Today’s investigation. Human-animal interactions can be defining moments in life. However, when resources become limited, these interactions develop into conflicts where humans compete and often displace and re-shape wild populations. This problem has motivated conservation biologists to study wild populations and contribute to solutions to the now increasing frequency of human-wildlife conflicts. A fundamental step in studying wild populations from a conservation biology perspective involves the description of its size - or age - structure and growth. Today, we will explore how to study some of these population parameters using the normal distribution, a statistical tool widely employed to describe many biological variables, with a dataset of round stingrays from the CSULB Shark Lab.
Introduction
In this lab we will examine the normal distribution and its Z-score standardization to describe the size structure of the round stingray population of Seal Beach, California (Figure 1). The increasing human population of southern California, coupled to the high population density of stingrays, has developed into a human-wildlife conflict resulting in many reported injuries. In order to maintain a balance between healthy fish populations and human recreation, it is crucial to understand key biological processes associated to the distribution of individuals in space. One of such processes is growth, a parameter defined by the change in size of the individuals across time.
Today, we will test whether the size structure of round stingrays at southern California follows a normal distribution or whether it is shifted towards a particular size class. For this, we will use data from CSULB Shark Lab’s surveys led by Dr. Chris Lowe (Figure 1). In his surveys, Dr. Lowe and colleagues used a large fishing seine (Figure 1, left panel) to capture and measure the body size of live round stingrays. So, let’s explore the normal distribution and how it is a useful statistical tools for describing the proportion of sizes observed in this round stingray population.
Figure 1. Round stingray Urobatis halleri collection and sampling. Stingrays were collected at Seal Beach, California. A 30 m long by 4.5 m tall seine was used to collect stingrays (left panel). The disc width of collected stingrays was measured (right panel). Images: Chris Lowe.
Upon completion of this lab, you should be able to:
External study resources:
References:
Worked example
So far, we have seen several probability distributions. In Chapter 4, we introduced the population and sampling distributions, and in Chapter 5 we discussed the null probability distribution. If we look closer, all these probability distributions discussed so far have a bell shape centered in a mean value.
Let’s explore this type of distribution commonly observed in biological data.
The normal distribution is a continuous probability distribution (see Chapter 4), meaning it describes the probability distribution of a continuous variable. It is symmetric and it is centered in its mean value. That is, the further a value is from the mean, the lower the probability density of observations.
Say we are interested in describing the diastolic blood pressure of a hypothetical human population of 20,0000 adults. After collecting the data, you observe the following distribution of values:
Figure 2. Frequency distribution of diastolic blood pressure for a theoretical human population (n = 20,000; left panel) and the probability density curve (right panel).
Such data has a mean = 70 mmHg and a standard deviation of 10 mmHg. Now, say we want to know how well these data fits a normal distribution. So, let’s fit a normal distribution to this data (mean = 70, standard deviation = 10):
Figure 3. Fitted normal distribution (red curve).
In this case, the fitted normal distribution follows the observed data pretty well and thus we can be confident that our data is normally distributed (Figure 3). This brings us to an important aspect of the normal distribution: a normal distribution is described by two parameters; the mean \(\mu\) (location) and the standard deviation \(\sigma\) (spread). As we discussed in Chapter 4, the y-axis shows a probability density (not a count), and thus to get the probability of a value of diastolic blood pressure, we need to calculate the area under the curve between the range (integration through calculus).
There are common features in any normal distribution. It is continuous and symmetric, it has a single mode and the probability density is highest at the mean. In other words, in a normal distribution, the mean, the median, and the mode, are all equal to each other.
Because of this, we can describe the area under the curve of a normal distribution using the location and the spread (\(\mu\) and \(\sigma\)):
In a normal distribution, 68.3% of the values are within ±1 standard deviation, while 95% of them are within ±2 standard deviations of the mean. That is, a randomly chosen observation drawn from a normal distribution has a 68.3% chance of falling between \(\mu - \sigma\) and \(\mu + \sigma\). Similarly, there is 95% chance that a randomly chosen observation falls within 2 standard deviations from the mean. For our example, 68.3% of the individuals have an expected diastolic blood pressure between 60-80 mmHg (mean ± 1SD) and 95% have an expected diastolic blood pressure between 50-90 mmHg, approximately (mean ± 2SD).
The standard normal distribution is a normal distribution of mean = 0 and standard deviation = 1 (Figure 4). A random variable with a standard normal distribution is called Z (or Z-score). A Z-score gives us an idea of how far from the mean a data point is.
Figure 4. The standard normal distribution.
This standardization of the normal distribution is useful because it allows us to (a) calculate the probability of a Z-score occurring and (b) compare two Z-scores from different normal distributions. Before discussing each one of these two steps, keep in mind that any normally distributed variable can be standardized for mean = 0 and standard deviation = 1 with the following equation:
where Z is a normal random variable, Y is an observation, \(\mu\) is the mean, and \(\sigma\) is the standard deviation.
(a) Probability of a Z-score occurring:
A random sample from the standard normal distribution will have 95% chance to fall between -2 and 2, approximately (Figure 4). More exactly, 95% of samples fall within 1.96 standard deviations of the mean (not 2). A statistical table helps us to estimate the probability of obtaining a range of values under the curve or the area under the curve (Figure 4). Following our example, say that a new drug for blood pressure is being used. However, this drug has severe adverse effects in individuals with a diastolic blood pressure lower than 40 mmHG and higher than 100 mmHg. So, what is the probability that an adult has a diastolic blood pressure (DBP) lower than 40 mmHG or higher than 100 mmHg?
Here, we followed the addition rule for mutually exclusive events:
\[ \begin{aligned} Pr[DBP < 40\ mmHg\ or\ DBP>100\ mmHg]&=Pr[DBP<40\ mmHg]+Pr[DBP>100\ mmHg] \end{aligned} \]
To solve this, the first step is to standardized 100 mmHG to a Z-score:
That is, 100 mmHg occurs at 3 standard deviations above the mean. To know Pr[Z > 3] or the area under the standard normal curve for Z values > 3, we use the statistical table which indicates a probability of Pr[Z > 3] = 0.00135. In other words, 0.135% of adults have a diastolic blood pressure higher than 100 mmHg.
Similarly for 40 mmHG, we standardize it and obtain the area under the curve using the statistical table:
In this case, 40 mmHg occurs at 3 standard deviations below the mean. Keep in mind that the normal distribution is simetrical and thus, the probability of being 3 standard deviations below the mean is the same as 3 standard deviations above the mean. When looking at the statistical table, that is a probability of 0.00135.
Thus,
\[ \begin{aligned} Pr[DBP < 40\ mmHg\ or\ DBP>100\ mmHg]&=Pr[DBP<40\ mmHg]+Pr[DBP>100\ mmHg]\\\\ &=0.0027. \end{aligned} \]
For our example, 0.27% of adults could suffer severe adverse effects from the new drug.
(b) Comparing two Z-scores from different normal distributions:
The standardization allows us to compare Z-scores with no limitations. Take a Z-score of 2.7 mmHg and another from another hypothetical human population of 3.1 mmHg. With no other information, we know that the second population has a Z-score 0.4 values farther from its mean than the first one (3.1 mmHg - 2.7 mmHg). Hence, we can also use standardization to compare the spread of the curve.
In Chapter 4, we studied the population and sampling distributions. The former is the whole set of values we are interested in and the latter is the probability distribution of all values for a sample statistic we might obtain when sampling the population. As we usually do not have data for all individuals in a population, we need to estimate the sampling distribution of the statistic of interest. The normal distribution is perfect for that! That is, if variable Y is normally distributed in the population (in our case the diastolic blood pressure), then the distribution of sample means \(\overline{y}\) is also normal (review the central limit theorem in Info-Box!).
In chapter 4, we also explained that \(\overline{y}\) is an unbiased estimate of \(\mu\). Thus, for our example we already know that the sampling distribution should have a \(\overline{y}=70\ mmHg\). The next parameter should be the spread of the curve which is the error around the mean (or the precision of \(\overline{y}\)) when estimating the sampling distribution, the standard error \(SE_\overline{y}\).
Say we are estimating the distribution from a sample of n = 100 individuals. The standard error is:From this equation, it is obvious that the shape of the sampling distribution depends on sample size, n. For our example, the normal sampling distribution for \(\overline{y}\) given \(\mu\) and \(\sigma\) is:
Figure 5. Sampling disitrubtion of ȳ.
With this information, we can estimate the probability of randomly choosing a sample with a mean in a given range (remember this is a continuous distribution) using the standard normal distribution. Of course, we would need to know the true mean (\(\mu\)) for this.
For our example, say that we want to estimate the probability of drawing a sample with mean > 74 mmHg from this sample of 100 individuals, Pr[\(\overline{y}>74\ mmHg\)].
First, we estimate the Z-score: \[
\begin{aligned}
Z=\frac{\overline{y}-\mu}{SE_\overline{y}}=\frac{74-70}{1}=4
\end{aligned}
\]
From the statistical tables, we say that Pr[Z > 4] = 0.00003. So, about 0.003% of samples of size n = 100 in this population will results in a \(\overline{y}\) equal to or greater than 74 mmHg.
Info-Box! The more we sample, the more our observations will tend to a normal distribution. The central limit theorem states that the mean of a large number of measurements randomly sampled approximates a normal distribution even if the measurement is not randomly distributed in the population.
For example, in Chapter 4 we estimated the distribution of age at delivery of rhesus macaque females from a sample n = 500 which clearly is not normally distributed (Chapter 4, Figure 4, see below). However, when estimating the sampling distribution of the mean age at delivery, such distribution is normally distributed.
Materials and Methods
Today’s activity Size structure of round stingrays is organized into one main exercise with multiple challenges to describe the size structure of round stingrays using the normal distribution. These exercises will also motivate inferences about the usefulness of the Z-score standardization to describe biological parameters.
Many environmental pressures drive populations towards older ages (e.g., trophic shifts) or younger ages (e.g., size-targeted fishing). Understanding these demographic shifts is fundamental to model population processes. In his surveys, Dr. Chris Lowe, and colleagues measure the body size of round stingrays and use it as a proxy for age. Their results show that southern California offers warm-water refuge for Urobatis halleri of large sizes associated to reproductive maturity. Let’s explore their data following each step of the Worked example.
1. Research question: What is the size structure of the population of round stingrays in Seal Beach?
Import the “ray” dataset to RStudio and explore it.
Questions:
A. The distribution of sizes.
Let’s first explore the size distribution of round stingrays based on disc width.
# summary of disc width
summary(ray$disc_width)
Challenge: Create two useful data visualizations for these data.
Questions:
Let’s first estimate the statistics mean and standard deviation.
# mean disc width
m <- mean(ray$disc_width)
m
# standard deviation of disc width
sd <- sd(ray$disc_width)
sd
Using the two statistics needed to define a normal distribution, let’s fit one in ggplot2 using the function stat_function(). The first argument of the function is the type of function, which in this case is the density probability (dnorm), followed by the sample size, and a list of arguments for mean and standard deviation.
# plotting the density probability of disc width
p1 <- ggplot(ray,aes(x=ray$disc_width)) +
geom_density()
p1
# fitting a normal distribution
p2 <- p1 + stat_function(fun = dnorm, n = 2427, args = list(mean = m, sd = sd),colour="red")
p2
Questions:
B. The standard normal distribution for size.
From the equation in Step 3 from the Worked example, we can easily standardize disc width. For this, let’s create a new column in our dataframe ray and add the standardized values. We use the “$” sign to create a new column. Let’s call this new column “z”.
# estimating Z for disc width
ray$z <- (ray$disc_width-m)/sd
# checking the new column in "ray"
head(ray)
# plotting z
p3 <- ggplot(ray,aes(x=ray$z)) +
geom_density()
p3
Now that we have the standard normal distribution for size, we can estimate the probability that a captured round stingray has a size within a particular range of values. Say we want to estimate the probability of a stingray to have a disc width of or smaller than 13 cm, which is the maximum size for stingrays in the “small” size class.
Stop, Think, Do: What is the probability that a captured round stingray has a disc width of or smaller than 13 cm? Stop and review Step 3 of the Worked example. Think about the question and formulate it mathematically. Do a formal presentation of the problem using proper nomenclature and solve it.
Now, let’s learn how to do this in R. For this, we use the function pnorm() which gives us the probability of getting a value equal to or less than Y under the normal curve. The first argument of the function is Y (in our case this value is 13), followed by the mean and standard deviation.
# probability of getting a disc width equal or less than 13 cm
pnorm(13, mean = m, sd = sd)
C. The sampling distribution of mean size.
Assuming \(\overline{y}\) is a good representation of \(\mu\) and \(\sigma\) is a good representation of s, we can use the Z-score to estimate the probability of obtaining a sample with a mean in a particular range of values. Say that the Shark Lab team is pretty confident that \(\overline{y}\) is close to \(\mu\) given the large dataset they have analyzed. If they go back to the field and randomly sample 10 stingrays, what is the probability that the observed mean disc width will be below 15 cm, rising questions about a potential shift in the size structure of stingrays?
But first, let’s estimate \(SE_\overline{y}\).
# standard error of the sample mean
se <- sd/sqrt(10)
se
# Z-score for the sample mean
z_mean <- (15-m)/se
z_mean
Questions:
Stop, Think, Do: using the dataset “ray”, estimate the probability that a randomly sampled round stingray in Seal Beach is small, medium or large. Stop and review the variables in the dataset. Think about how to obtain a probability for a range of values under a normal curve. Do the analysis in R and present it!
Take-home exercise
Task. Does the size structure of round stingrays at Seal Beach, California, changes annually? Following today’s exercise, do separate analyses for each year available. Hint: use levels(as.factor(ray$year)) to know the years of data collection. Present the observed and the normally fitted size structure for each year. Create an annotated R script that includes the following steps; (1) data import, (2) package loading, and (3) codes for each step in the analysis, including figures. Accompany your R script with a concluding paragraph summarizing and interpreting your results. Your assignment should include two files; (1) a .doc file with the figures and paragraph, (2) an .R file containing your annotated R script. Stick to the question and do not add unnecessary steps to your R script.
Guided questions for writing the concluding paragraph: